Mutect2 joint somatic variant calling workflow (CRAM)¶

Mutect2JointSomaticWorkflowCram · 1 contributor · 1 version

This workflow uses the capability of mutect2 to call several samples at the same time and improve recall and accuracy through a joint model.: Most of these tools are still in a beta state and not intended for main production (as of 4.1.4.0) There are also som major tweaks we have to do for runtime, as the amount of data might overwhelm the tools otherwise.

Quickstart¶

from janis_bioinformatics.tools.dawson.workflows.variantcalling.multisample.mutect2.mutect2jointsomaticworkflow_cram import Mutect2JointSomaticWorkflowCram

wf = WorkflowBuilder("myworkflow")

wf.step(
    "mutect2jointsomaticworkflowcram_step",
    Mutect2JointSomaticWorkflowCram(
        normalBams=None,
        tumorBams=None,
        normalName=None,
        biallelicSites=None,
        reference=None,
        panelOfNormals=None,
        germlineResource=None,
    )
)
wf.output("out", source=mutect2jointsomaticworkflowcram_step.out)

OR

Install Janis
Ensure Janis is configured to work with Docker or Singularity.
Ensure all reference files are available:

Note

More information about these inputs are available below.

Generate user input files for Mutect2JointSomaticWorkflowCram:

# user inputs
janis inputs Mutect2JointSomaticWorkflowCram > inputs.yaml

inputs.yaml

biallelicSites: biallelicSites.vcf.gz
germlineResource: germlineResource.vcf.gz
normalBams:
- normalBams_0.cram
- normalBams_1.cram
normalName: <value>
panelOfNormals: panelOfNormals.vcf.gz
reference: reference.fasta
tumorBams:
- tumorBams_0.cram
- tumorBams_1.cram

Run Mutect2JointSomaticWorkflowCram with:

janis run [...run options] \
    --inputs inputs.yaml \
    Mutect2JointSomaticWorkflowCram

Information¶

URL: No URL to the documentation was provided

ID:	`Mutect2JointSomaticWorkflowCram`
URL:	No URL to the documentation was provided
Versions:	0.1.1
Authors:	Sebastian Hollizeck
Citations:
Created:	2019-10-30
Updated:	2020-12-10

Outputs¶

name	type	documentation
out	Gzipped<VCF>

Workflow¶

Embedded Tools¶

Create genomic call regions	`CreateCallRegions/v0.1.0`
GatkMutect2	`Gatk4Mutect2_cram/4.1.8.1`
BCFTools: Concat	`bcftoolsConcat/v1.9`
BCFTools: Index	`bcftoolsIndex/v1.9`
GATK4: LearnReadOrientationModel	`Gatk4LearnReadOrientationModel/4.1.8.1`
GATK4: MergeMutectStats	`Gatk4MergeMutectStats/4.1.8.1`
GATK4: GetPileupSummaries	`Gatk4GetPileupSummaries_cram/4.1.8.1`
GATK4: CalculateContamination	`Gatk4CalculateContamination/4.1.8.1`
GATK4: GetFilterMutectCalls	`Gatk4FilterMutectCalls/4.1.8.1`
BCFTools: Normalize	`bcftoolsNorm/v1.9`

Additional configuration (inputs)¶

name	type	documentation
normalBams	Array<CramPair>	The bams that make up the normal sample. Generally Mutect will expect one bam per sample, but as long as the sample ids in the bam header are set appropriatly, multiple bams per sample will work
tumorBams	Array<CramPair>	The bams that contain the tumour samples. Generally Mutect will expect one bam per sample, but as long as the sample ids in the bam header are set appropriatly, multiple bams per sample will work
normalName	String	The sample id of the normal sample. This id will be used to distingiush reads from this sample from all other samples. This id needs to tbe the one set in the bam header
biallelicSites	Gzipped<VCF>	A vcf of common biallalic sites from a population. This will be used to estimate sample contamination.
reference	FastaWithIndexes	A fasta and dict indexed reference, which needs to be the reference, the bams were aligned to.
panelOfNormals	Gzipped<VCF>	The panel of normals, which summarises the technical and biological sites of errors. Its usually a good idea to generate this for your own cohort, but GATK suggests around 30 normals, so their panel is usually a good idea.
germlineResource	Gzipped<VCF>	Vcf of germline variants. GATK provides this as well, but it can easily substituted with the newst gnomad etc vcf.
regionSize	Optional<Integer>	The size of the regions over which to parallelise the analysis. This should be adjusted, if there are lots of samples or a very high sequencing depth. default: 10M bp
createCallRegions_equalize	Optional<Boolean>

Workflow Description Language¶

version development

import "tools/CreateCallRegions_v0_1_0.wdl" as C
import "tools/Gatk4Mutect2_cram_4_1_8_1.wdl" as G
import "tools/bcftoolsConcat_v1_9.wdl" as B
import "tools/bcftoolsIndex_v1_9.wdl" as B2
import "tools/Gatk4LearnReadOrientationModel_4_1_8_1.wdl" as G2
import "tools/Gatk4MergeMutectStats_4_1_8_1.wdl" as G3
import "tools/Gatk4GetPileupSummaries_cram_4_1_8_1.wdl" as G4
import "tools/Gatk4CalculateContamination_4_1_8_1.wdl" as G5
import "tools/Gatk4FilterMutectCalls_4_1_8_1.wdl" as G6
import "tools/bcftoolsNorm_v1_9.wdl" as B3

workflow Mutect2JointSomaticWorkflowCram {
  input {
    Array[File] normalBams
    Array[File] normalBams_crai
    Array[File] tumorBams
    Array[File] tumorBams_crai
    String normalName
    File biallelicSites
    File biallelicSites_tbi
    File reference
    File reference_fai
    File reference_amb
    File reference_ann
    File reference_bwt
    File reference_pac
    File reference_sa
    File reference_dict
    Int? regionSize = 10000000
    File panelOfNormals
    File panelOfNormals_tbi
    File germlineResource
    File germlineResource_tbi
    Boolean? createCallRegions_equalize = true
  }
  call C.CreateCallRegions as createCallRegions {
    input:
      reference=reference,
      reference_fai=reference_fai,
      regionSize=select_first([regionSize, 10000000]),
      equalize=select_first([createCallRegions_equalize, true])
  }
  scatter (c in createCallRegions.regions) {
     call G.Gatk4Mutect2_cram as mutect2 {
      input:
        tumorBams=tumorBams,
        tumorBams_crai=tumorBams_crai,
        normalBams=normalBams,
        normalBams_crai=normalBams_crai,
        normalSample=normalName,
        reference=reference,
        reference_fai=reference_fai,
        reference_amb=reference_amb,
        reference_ann=reference_ann,
        reference_bwt=reference_bwt,
        reference_pac=reference_pac,
        reference_sa=reference_sa,
        reference_dict=reference_dict,
        germlineResource=germlineResource,
        germlineResource_tbi=germlineResource_tbi,
        intervals=c,
        panelOfNormals=panelOfNormals,
        panelOfNormals_tbi=panelOfNormals_tbi
    }
  }
  call B.bcftoolsConcat as concat {
    input:
      vcf=mutect2.out
  }
  call B2.bcftoolsIndex as indexUnfiltered {
    input:
      vcf=concat.out
  }
  call G2.Gatk4LearnReadOrientationModel as learn {
    input:
      f1r2CountsFiles=mutect2.f1f2r_out
  }
  call G3.Gatk4MergeMutectStats as mergeMutect2 {
    input:
      statsFiles=mutect2.stats
  }
  call G4.Gatk4GetPileupSummaries_cram as pileup {
    input:
      bam=tumorBams,
      bam_crai=tumorBams_crai,
      sites=biallelicSites,
      sites_tbi=biallelicSites_tbi,
      intervals=biallelicSites,
      reference=reference,
      reference_fai=reference_fai,
      reference_amb=reference_amb,
      reference_ann=reference_ann,
      reference_bwt=reference_bwt,
      reference_pac=reference_pac,
      reference_sa=reference_sa,
      reference_dict=reference_dict
  }
  call G5.Gatk4CalculateContamination as contamination {
    input:
      pileupTable=pileup.out
  }
  call G6.Gatk4FilterMutectCalls as filtering {
    input:
      contaminationTable=contamination.contOut,
      segmentationFile=contamination.segOut,
      statsFile=mergeMutect2.out,
      readOrientationModel=learn.out,
      vcf=indexUnfiltered.out,
      vcf_tbi=indexUnfiltered.out_tbi,
      reference=reference,
      reference_fai=reference_fai,
      reference_amb=reference_amb,
      reference_ann=reference_ann,
      reference_bwt=reference_bwt,
      reference_pac=reference_pac,
      reference_sa=reference_sa,
      reference_dict=reference_dict
  }
  call B3.bcftoolsNorm as normalise {
    input:
      vcf=filtering.out,
      reference=reference,
      reference_fai=reference_fai
  }
  call B2.bcftoolsIndex as indexFiltered {
    input:
      vcf=normalise.out
  }
  output {
    File out = indexFiltered.out
    File out_tbi = indexFiltered.out_tbi
  }
}

Common Workflow Language¶

#!/usr/bin/env cwl-runner
class: Workflow
cwlVersion: v1.2
label: Mutect2 joint somatic variant calling workflow (CRAM)
doc: |-
  This workflow uses the capability of mutect2 to call several samples at the same time and improve recall and accuracy through a joint model.
          Most of these tools are still in a beta state and not intended for main production (as of 4.1.4.0)
          There are also som major tweaks we have to do for runtime, as the amount of data might overwhelm the tools otherwise.

requirements:
- class: InlineJavascriptRequirement
- class: StepInputExpressionRequirement
- class: ScatterFeatureRequirement

inputs:
- id: normalBams
  doc: |-
    The bams that make up the normal sample. Generally Mutect will expect one bam per sample, but as long as the sample ids in the bam header are set appropriatly, multiple bams per sample will work
  type:
    type: array
    items: File
  secondaryFiles:
  - pattern: .crai
- id: tumorBams
  doc: |-
    The bams that contain the tumour samples. Generally Mutect will expect one bam per sample, but as long as the sample ids in the bam header are set appropriatly, multiple bams per sample will work
  type:
    type: array
    items: File
  secondaryFiles:
  - pattern: .crai
- id: normalName
  doc: |-
    The sample id of the normal sample. This id will be used to distingiush reads from this sample from all other samples. This id needs to tbe the one set in the bam header
  type: string
- id: biallelicSites
  doc: |-
    A vcf of common biallalic sites from a population. This will be used to estimate sample contamination.
  type: File
  secondaryFiles:
  - pattern: .tbi
- id: reference
  doc: |-
    A fasta and dict indexed reference, which needs to be the reference, the bams were aligned to.
  type: File
  secondaryFiles:
  - pattern: .fai
  - pattern: .amb
  - pattern: .ann
  - pattern: .bwt
  - pattern: .pac
  - pattern: .sa
  - pattern: ^.dict
- id: regionSize
  doc: |-
    The size of the regions over which to parallelise the analysis. This should be adjusted, if there are lots of samples or a very high sequencing depth. default: 10M bp
  type: int
  default: 10000000
- id: panelOfNormals
  doc: |-
    The panel of normals, which summarises the technical and biological sites of errors. Its usually a good idea to generate this for your own cohort, but GATK suggests around 30 normals, so their panel is usually a good idea.
  type: File
  secondaryFiles:
  - pattern: .tbi
- id: germlineResource
  doc: |-
    Vcf of germline variants. GATK provides this as well, but it can easily substituted with the newst gnomad etc vcf.
  type: File
  secondaryFiles:
  - pattern: .tbi
- id: createCallRegions_equalize
  type: boolean
  default: true

outputs:
- id: out
  type: File
  secondaryFiles:
  - pattern: .tbi
  outputSource: indexFiltered/out

steps:
- id: createCallRegions
  label: Create genomic call regions
  in:
  - id: reference
    source: reference
  - id: regionSize
    source: regionSize
  - id: equalize
    source: createCallRegions_equalize
  run: tools/CreateCallRegions_v0_1_0.cwl
  out:
  - id: regions
- id: mutect2
  label: GatkMutect2
  in:
  - id: tumorBams
    source: tumorBams
  - id: normalBams
    source: normalBams
  - id: normalSample
    source: normalName
  - id: reference
    source: reference
  - id: germlineResource
    source: germlineResource
  - id: intervals
    source: createCallRegions/regions
  - id: panelOfNormals
    source: panelOfNormals
  scatter:
  - intervals
  run: tools/Gatk4Mutect2_cram_4_1_8_1.cwl
  out:
  - id: out
  - id: stats
  - id: f1f2r_out
  - id: bam
- id: concat
  label: 'BCFTools: Concat'
  in:
  - id: vcf
    source: mutect2/out
  run: tools/bcftoolsConcat_v1_9.cwl
  out:
  - id: out
- id: indexUnfiltered
  label: 'BCFTools: Index'
  in:
  - id: vcf
    source: concat/out
  run: tools/bcftoolsIndex_v1_9.cwl
  out:
  - id: out
- id: learn
  label: 'GATK4: LearnReadOrientationModel'
  in:
  - id: f1r2CountsFiles
    source: mutect2/f1f2r_out
  run: tools/Gatk4LearnReadOrientationModel_4_1_8_1.cwl
  out:
  - id: out
- id: mergeMutect2
  label: 'GATK4: MergeMutectStats'
  in:
  - id: statsFiles
    source: mutect2/stats
  run: tools/Gatk4MergeMutectStats_4_1_8_1.cwl
  out:
  - id: out
- id: pileup
  label: 'GATK4: GetPileupSummaries'
  in:
  - id: bam
    source: tumorBams
  - id: sites
    source: biallelicSites
  - id: intervals
    source: biallelicSites
  - id: reference
    source: reference
  run: tools/Gatk4GetPileupSummaries_cram_4_1_8_1.cwl
  out:
  - id: out
- id: contamination
  label: 'GATK4: CalculateContamination'
  in:
  - id: pileupTable
    source: pileup/out
  run: tools/Gatk4CalculateContamination_4_1_8_1.cwl
  out:
  - id: contOut
  - id: segOut
- id: filtering
  label: 'GATK4: GetFilterMutectCalls'
  in:
  - id: contaminationTable
    source: contamination/contOut
  - id: segmentationFile
    source: contamination/segOut
  - id: statsFile
    source: mergeMutect2/out
  - id: readOrientationModel
    source: learn/out
  - id: vcf
    source: indexUnfiltered/out
  - id: reference
    source: reference
  run: tools/Gatk4FilterMutectCalls_4_1_8_1.cwl
  out:
  - id: out
- id: normalise
  label: 'BCFTools: Normalize'
  in:
  - id: vcf
    source: filtering/out
  - id: reference
    source: reference
  run: tools/bcftoolsNorm_v1_9.cwl
  out:
  - id: out
- id: indexFiltered
  label: 'BCFTools: Index'
  in:
  - id: vcf
    source: normalise/out
  run: tools/bcftoolsIndex_v1_9.cwl
  out:
  - id: out
id: Mutect2JointSomaticWorkflowCram